AITopics | dataset creator

Collaborating Authors

dataset creator

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

605bbd006beee7e0589a51d6a50dcae1-Supplemental-Datasets_and_Benchmarks_Track.pdf

Eshta Bhardwaj

Neural Information Processing SystemsFeb-15-2026, 02:37:30 GMT

data mining, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
(12 more...)

Genre:

Workflow (0.67)
Overview (0.67)
Research Report > New Finding (0.45)

Industry:

Information Technology (1.00)
Health & Medicine (1.00)
Energy (1.00)
(3 more...)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
(4 more...)

Add feedback

A for FLAIR

Neural Information Processing SystemsFeb-12-2026, 21:38:17 GMT

Unqualified images are removed as described in Appendix A.3. Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to

artificial intelligence, dataset, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Industry:

Law (1.00)
Government (0.68)
Information Technology > Security & Privacy (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track The State of Data Curation at NeurIPS: Appendix A. Rubric 2

Eshta Bhardwaj

Neural Information Processing SystemsOct-10-2025, 04:11:42 GMT

A Spoken Language Dataset of Descriptions for Speech - Based Gro unded Language Learning.

data curation, dataset, neural information processing system, (10 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
(12 more...)

Genre:

Workflow (0.67)
Overview (0.67)
Research Report > New Finding (0.45)

Industry:

Information Technology (1.00)
Health & Medicine (1.00)
Energy (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Add feedback

A for FLAIR

Neural Information Processing SystemsAug-19-2025, 20:32:44 GMT

Unqualified images are removed as described in Appendix A.3. Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to

artificial intelligence, dataset, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Industry:

Law (1.00)
Government (0.68)
Information Technology > Security & Privacy (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Building Better Datasets: Seven Recommendations for Responsible Design from Dataset Creators

Orr, Will, Crawford, Kate

arXiv.org Artificial IntelligenceAug-30-2024

The increasing demand for high-quality datasets in machine learning has raised concerns about the ethical and responsible creation of these datasets. Dataset creators play a crucial role in developing responsible practices, yet their perspectives and expertise have not yet been highlighted in the current literature. In this paper, we bridge this gap by presenting insights from a qualitative study that included interviewing 18 leading dataset creators about the current state of the field. We shed light on the challenges and considerations faced by dataset creators, and our findings underscore the potential for deeper collaboration, knowledge sharing, and collective development. Through a close analysis of their perspectives, we share seven central recommendations for improving responsible dataset creation, including issues such as data quality, documentation, privacy and consent, and how to mitigate potential harms from unintended use cases. By fostering critical reflection and sharing the experiences of dataset creators, we aim to promote responsible dataset creation practices and develop a nuanced understanding of this crucial but often undervalued aspect of machine learning research.

creator, dataset, interview, (14 more...)

arXiv.org Artificial Intelligence

2409.00252

Country:

North America > United States > California (0.28)
North America > United States > New York > New York County > New York City (0.04)
Oceania > Australia (0.04)
(7 more...)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Quality (1.00)
Information Technology > Communications (1.00)
(4 more...)

Add feedback

Technical Debt In Machine Learning System - A Model Driven Perspective - DataScienceCentral.com

#artificialintelligenceSep-15-2022, 13:41:32 GMT

This article is part 2 of the two part series on Technical Debt in Machine Learning Systems development. Introduced a simple yet powerful Model of Technical Debt for Machine Learning Systems. The model is simple to remember, easier to extend, and provides a reliable means for reliable and maintainable Machine Learning Systems. This, in a nutshell, is the value proposition of this post. Introduced four dimensions of the Model, namely, Time, Input, System and Output.

learning system, machine learning system, technical debt, (12 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata

Heger, Amy K., Marquis, Liz B., Vorvoreanu, Mihaela, Wallach, Hanna, Vaughan, Jennifer Wortman

arXiv.org Artificial IntelligenceAug-24-2022

Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice through more deliberate reflection on datasets and transparency around the processes by which they are created, researchers and practitioners have begun to advocate for increased data documentation and have proposed several data documentation frameworks. However, there is little research on whether these data documentation frameworks meet the needs of ML practitioners, who both create and consume datasets. To address this gap, we set out to understand ML practitioners' data documentation perceptions, needs, challenges, and desiderata, with the goal of deriving design requirements that can inform future data documentation frameworks. We conducted a series of semi-structured interviews with 14 ML practitioners at a single large, international technology company. We had them answer a list of questions taken from datasheets for datasets (Gebru, 2021). Our findings show that current approaches to data documentation are largely ad hoc and myopic in nature. Participants expressed needs for data documentation frameworks to be adaptable to their contexts, integrated into their existing tools and workflows, and automated wherever possible. Despite the fact that data documentation frameworks are often motivated from the perspective of responsible AI, participants did not make the connection between the questions that they were asked to answer and their responsible AI implications. In addition, participants often had difficulties prioritizing the needs of dataset consumers and providing information that someone unfamiliar with their datasets might need to know. Based on these findings, we derive seven design requirements for future data documentation frameworks.

data documentation, dataset, participant, (14 more...)

arXiv.org Artificial Intelligence

2206.02923

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.04)
North America > United States > Washington > King County > Redmond (0.04)
(5 more...)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Law (1.00)
Information Technology (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback

ShortcutLens: A Visual Analytics Approach for Exploring Shortcuts in Natural Language Understanding Dataset

Jin, Zhihua, Wang, Xingbo, Cheng, Furui, Sun, Chunhui, Liu, Qun, Qu, Huamin

arXiv.org Artificial IntelligenceAug-16-2022

Benchmark datasets play an important role in evaluating Natural Language Understanding (NLU) models. However, shortcuts -- unwanted biases in the benchmark datasets -- can damage the effectiveness of benchmark datasets in revealing models' real capabilities. Since shortcuts vary in coverage, productivity, and semantic meaning, it is challenging for NLU experts to systematically understand and avoid them when creating benchmark datasets. In this paper, we develop a visual analytics system, ShortcutLens, to help NLU experts explore shortcuts in NLU benchmark datasets. The system allows users to conduct multi-level exploration of shortcuts. Specifically, Statistics View helps users grasp the statistics such as coverage and productivity of shortcuts in the benchmark dataset. Template View employs hierarchical and interpretable templates to summarize different types of shortcuts. Instance View allows users to check the corresponding instances covered by the shortcuts. We conduct case studies and expert interviews to evaluate the effectiveness and usability of the system. The results demonstrate that ShortcutLens supports users in gaining a better understanding of benchmark dataset issues through shortcuts, inspiring them to create challenging and pertinent benchmark datasets.

machine learning, natural language, shortcut, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TVCG.2023.3236380

2208.0801

Country:

Asia > China > Hong Kong (0.04)
North America > United States > New York > Suffolk County > Stony Brook (0.04)
Europe > Ireland (0.04)
(3 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Understanding (0.61)

Add feedback

Datasheets for Datasets

Communications of the ACMNov-20-2021, 06:55:26 GMT

Data plays a critical role in machine learning. Every machine learning model is trained and evaluated using data, quite often in the form of static datasets. The characteristics of these datasets fundamentally influence a model's behavior: a model is unlikely to perform well in the wild if its deployment context does not match its training or evaluation datasets, or if these datasets reflect unwanted societal biases. Mismatches like this can have especially severe consequences when machine learning models are used in high-stakes domains, such as criminal justice,1,13,24 hiring,19 critical infrastructure,11,21 and finance.18 Even in other domains, mismatches may lead to loss of revenue or public relations setbacks.

dataset, dataset creator, datasheet, (14 more...)

Communications of the ACM

AI-Alerts: 2021 > 2021-11 > AAAI AI-Alert for Nov 23, 2021 (1.00)

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > New York > New York County > New York City (0.05)
(5 more...)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback